Introduction
This is your assignment for Module 4 Putting It All Together, focused
on the material you learned in the lectures and recitation activities on
PCA, Manhattan plots, interactive
plots, and the leftovers
Submission info:
- Please submit this assignment by uploading a knitted
.html to Carmen
- Your headers should be logical and your report and code annotated
with descriptions of what you’re doing. Starting on this assignment, I
will be considering for overall format and readability of your
assignment as part of your grade. I am doing this because the format of
your report will be considered for your final capstone assignment. This
means you should have reasonable headers and header levels,
understandable flow between plots and code, and use Markdown language
when appropriate.
- Make sure you include the Code Download button so that I can see
your code as well
- Customize the YAML and the document so you like how it looks
Remember there are often many ways to reach the same end product. I
have showed you many ways in class to achieve a similar end product, you
only need to show me one of them. As long as your answer is reasonable,
you will get full credit even if its different than what I intended.
This assignment will be due on Wednesday, November 30, 2022, at
11:59pm.
Data
The data
we will be using is the same we used in the ggplot102
recitation that includes information about dog breed trait
information from the American Kennel Club.
Download the data using the code below. Don’t use the code from week
5 recitation.
breed_traits <- readr::read_csv('data/breed_traits_fixed.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
For a little hint, here are the packages I used to complete this
task. Yours might not be exactly the same.
library(tidyverse)
library(factoextra)
library(glue)
library(patchwork)
library(ggrepel)
library(plotly)
library(gghighlight)
1. Principal components analysis (PCA) of American Kennel Club dog
bred trait data (6 pts)
Run a PCA on breed_traits for all of the numeric data
present in that dataset. Create the following plots and make them of
publication quality:
- A scree plot
- A scores plot
- A loadings plot
- A two panel plot that has the scores plot and the scree plot
together
breed_traits_quant <- breed_traits %>%
select(-`Coat Type`, -`Coat Length`)
# run PCA
# no scaling (because all are on the same scale)
# centering is a good idea
trait_pca <- prcomp(breed_traits_quant[,-1],
scale = FALSE,
center = TRUE)
importance <- summary(trait_pca)$importance %>%
as.data.frame()
Scree plot
With fviz_eig()
fviz_eig(trait_pca)

Manually
importance_tidy <- importance %>%
rownames_to_column(var = "measure") %>%
pivot_longer(cols = PC1:PC10,
names_to = "PC",
values_to = "value")
# create a vector with the order we want
my_order <- colnames(importance)
# relevel according to my_order
importance_tidy$PC <- fct_relevel(importance_tidy$PC, levels = my_order)
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
## Warning: 4 unknown levels in `f`: PC11, PC12, PC13, and PC14
# plot
(scree_plot <- importance_tidy %>%
filter(measure == "Proportion of Variance") %>%
ggplot(aes(x = PC, y = value)) +
geom_col(alpha = 0.1, color = "black") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Principal component",
y = "Percent variance explained",
title = "Scree plot of dog traits"))

Scores plot
With fviz_pca_ind()
fviz_pca_ind(trait_pca)

Manually
# create a df of trait_pca$x
scores_raw <- as.data.frame(trait_pca$x)
# bind breed name
scores <- bind_cols(breed_traits[,1], # first column where we have breed name
scores_raw)
# create objects indicating percent variance explained by PC1 and PC2
PC1_percent <- round((importance[2,1])*100, # index 2nd row, 1st column, times 100
1) # round to 1 decimal
PC2_percent <- round((importance[2,2])*100, 1)
# plot
(scores_plot <- scores %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(color = "black") +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of American Kennel Club Dog Trait Data"))

Loadings plot
With fviz_pca_var()
fviz_pca_var(trait_pca)

Manually
# grab raw loadings, without any metadata
loadings_raw <- as.data.frame(trait_pca$rotation)
# move rowname to column
loadings <- loadings_raw %>%
rownames_to_column(var = "Trait")
(loadings_plot <- loadings %>%
ggplot(aes(x = PC1, y = PC2, label = Trait)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point() +
geom_label_repel(size = 2.5) +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Loadings Plot of American Kennel Club Dog Trait Data"))

Scree and scores plots
scree_plot + scores_plot

2. Make your PCA plot interactive (2 pts)
Make your PCA scores plot interactive, and so that when you hover
each point, you can see what the name of that dog breed is (and only the
breed of that dog).
scores_plotly <- scores %>%
ggplot(aes(x = PC1, y = PC2, text = Breed)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(color = "black") +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of American Kennel Club Dog Trait Data")
ggplotly(scores_plotly, tooltip = "text")
3. See how your PCA related to breed popularity (2 pts)
Using breed_traits and breed_rank_all,
label the points that show data for the top 10 dog breeds in 2020 and
color them different from the rest of the points. Your plot does not
need to be interactive.
Try joining the dfs based on Breed
# grab just breed and the rank in 2020
breed_rank_to_join <- breed_rank_all %>%
select(Breed, `2020 Rank`)
# join with breed_traits
joined <- left_join(breed_traits, breed_rank_to_join,
by = "Breed")
# check
knitr::kable(head(joined))
| Retrievers (Labrador) |
5 |
5 |
5 |
4 |
2 |
2 |
Double |
Short |
5 |
5 |
3 |
5 |
5 |
5 |
3 |
4 |
NA |
| French Bulldogs |
5 |
5 |
4 |
3 |
1 |
3 |
Smooth |
Short |
5 |
5 |
3 |
5 |
4 |
3 |
1 |
3 |
NA |
| German Shepherd Dogs |
5 |
5 |
3 |
4 |
2 |
2 |
Double |
Medium |
3 |
4 |
5 |
5 |
5 |
5 |
3 |
5 |
NA |
| Retrievers (Golden) |
5 |
5 |
5 |
4 |
2 |
2 |
Double |
Medium |
5 |
4 |
3 |
5 |
5 |
3 |
1 |
4 |
NA |
| Bulldogs |
4 |
3 |
3 |
3 |
3 |
3 |
Smooth |
Short |
4 |
4 |
3 |
3 |
4 |
3 |
2 |
3 |
5 |
| Poodles |
5 |
5 |
3 |
1 |
4 |
1 |
Curly |
Long |
5 |
5 |
5 |
4 |
5 |
4 |
4 |
5 |
6 |
We have a bunch of NAs in 2020 Rank lets figure out why.
I notice we have NAs in all the Breed names that have spaces
# create a vector of breed from each df
breed_traits_breed <- breed_traits$Breed
breed_rank_breed <- breed_rank_all$Breed
# are they the same?
all.equal(breed_traits_breed, breed_rank_breed) # no, many string mismatches
## [1] "146 string mismatches"
# testing with French Bulldogs bc those are my fav
test1 <- breed_traits_breed[2]
test2 <- breed_rank_breed[2]
library(stringi) # has the function stri_compare()
stri_compare(test1, test2) # 1 string different, but ok which one?
## [1] 1
library(devtools) # allows you to use install_github
## Loading required package: usethis
install_github("renozao/pkgmaker") # install pkgmaker for string comparison
## Skipping install of 'pkgmaker' from a github remote, the SHA1 (3c58d769) has not changed since last install.
## Use `force = TRUE` to force installation
library(pkgmaker)
## Loading required package: registry
str_diff(test1, test2) # the problem is in the space! But what? It looks the same to me
## French Bulldogs
## ......*........
## French Bulldogs
# this solution was from Dan Zhang (thanks Dan!)
charToRaw(test1)
## [1] 46 72 65 6e 63 68 c2 a0 42 75 6c 6c 64 6f 67 73
charToRaw(test2)
## [1] 46 72 65 6e 63 68 20 42 75 6c 6c 64 6f 67 73
# test1 has c2a0 for the space, and test2 has 20
# c2a0 is a weird type of space, and 20 is the regular space
# if you actually download and open the csvs in a program like excel
# you will see some weird unicode nonsense (my fault)
Trying another way to join. Since we can check that the dfs are
ordered the same way, we can join by using bind_cols().
# create df of just 2020 rank to add to bind to scores
rank2020 <- breed_rank_all %>%
select(`2020 Rank`)
# bind to scores
# joining is tough here because it doesn't handle spaces well
# here i made sure that both dfs were ordered exactly the same way
scores_popularity <- bind_cols(scores, rank2020)
scores_popularity_top10 <- scores_popularity %>%
filter(`2020 Rank` <= 10)
By creating a new df and plotting
# creating a new df and plotting
scores_popularity %>%
ggplot() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(aes(x = PC1, y = PC2)) +
geom_point(data = scores_popularity_top10,
aes(x = PC1, y = PC2), color = "darkcyan") +
geom_label_repel(data = scores_popularity_top10,
aes(x = PC1, y = PC2, label = Breed),
color = "darkcyan", size = 2.5) +
theme_minimal() +
theme(legend.position = "none") +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of American Kennel Club Dog Trait Data",
subtitle = "Labelled points are the top 10 most popular breeds from 2020")

Without creating a new df, and using if_else()
# without creating a new df and if_else
scores_popularity %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point() +
geom_point(color = if_else(scores_popularity$`2020 Rank` <= 10, "darkcyan", "black")) +
geom_label_repel(aes(label = if_else(`2020 Rank` <= 10, Breed, NULL)),
size = 2.5, color = "darkcyan") +
theme_minimal() +
theme(legend.position = "none") +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of American Kennel Club Dog Trait Data",
subtitle = "Labelled points are the top 10 most popular breeds from 2020")
## Warning: Removed 185 rows containing missing values (geom_label_repel).

Using gghighlight()
# using gghighlight
scores_popularity %>%
ggplot() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(aes(x = PC1, y = PC2)) +
gghighlight(`2020 Rank` <= 10, use_direct_label = TRUE, label_key = Breed) +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of American Kennel Club Dog Trait Data",
subtitle = "Labelled points are the top 10 most popular breeds from 2020")
## Warning: Could not calculate the predicate for layer 1, layer 2; ignored

---
title: "Module 4 Assignment"
author: "You"
date: "Due 11-30-2022"
output:
  html_document:
    toc: true
    toc_depth: 5
    toc_float: true
    theme: flatly
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction
This is your assignment for Module 4 Putting It All Together, focused on the material you learned in the lectures and recitation activities on [PCA](4_09_PCA/09_pca.html), [Manhattan plots](4_10_manhattan/10_manhattan.html), [interactive plots](4_11_interactive_plots/4_11_interactive_plots.html), and the [leftovers](4_12_leftovers/4_12_leftovers.html)

Submission info:

- Please submit this assignment by uploading a **knitted .html** to Carmen
- Your headers should be logical and your report and code annotated with descriptions of what you're doing. Starting on this assignment, I will be considering for overall format and readability of your assignment as part of your grade. I am doing this because the format of your report will be considered for your final capstone assignment. This means you should have reasonable headers and header levels, understandable flow between plots and code, and use Markdown language when appropriate.
- Make sure you include the Code Download button so that I can see your code as well
- Customize the YAML and the document so you like how it looks

Remember there are often many ways to reach the same end product. I have showed you many ways in class to achieve a similar end product, you only need to show me one of them. As long as your answer is reasonable, you will get full credit even if its different than what I intended.

> This assignment will be due on Wednesday, November 30, 2022, at 11:59pm.

### Data
The [data](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-02-01/readme.md) we will be using is the same we used in the [ggplot102 recitation](2_05_themes_labels_facets/05_ggplot102_recitation.html) that includes information about dog breed trait information from the American Kennel Club.

Download the data using the code below. Don't use the code from week 5 recitation.
```{r, message = FALSE, warning = FALSE}
breed_traits <- readr::read_csv('data/breed_traits_fixed.csv')

trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')

breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
```

For a little hint, here are the packages I used to complete this task. Yours might not be exactly the same.
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
library(factoextra)
library(glue)
library(patchwork)
library(ggrepel)
library(plotly)
library(gghighlight)
```


## 1. Principal components analysis (PCA) of American Kennel Club dog bred trait data (6 pts)

Run a PCA on `breed_traits` for all of the numeric data present in that dataset. Create the following plots and make them of publication quality:

1. A scree plot
2. A scores plot
3. A loadings plot
4. A two panel plot that has the scores plot and the scree plot together

```{r}
breed_traits_quant <- breed_traits %>%
  select(-`Coat Type`, -`Coat Length`)

# run PCA
# no scaling (because all are on the same scale)
# centering is a good idea
trait_pca <- prcomp(breed_traits_quant[,-1],
                    scale = FALSE,
                    center = TRUE)

importance <- summary(trait_pca)$importance %>%
  as.data.frame()
```

### Scree plot

#### With `fviz_eig()`
```{r}
fviz_eig(trait_pca)
```

#### Manually
```{r}
importance_tidy <- importance %>%
  rownames_to_column(var = "measure") %>%
  pivot_longer(cols = PC1:PC10,
               names_to = "PC",
               values_to = "value")

# create a vector with the order we want
my_order <- colnames(importance)

# relevel according to my_order
importance_tidy$PC <- fct_relevel(importance_tidy$PC, levels = my_order)

# plot
(scree_plot <- importance_tidy %>%
  filter(measure == "Proportion of Variance") %>%
  ggplot(aes(x = PC, y  = value)) +
  geom_col(alpha = 0.1, color = "black") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(x = "Principal component",
       y = "Percent variance explained",
       title = "Scree plot of dog traits"))
```

### Scores plot

#### With `fviz_pca_ind()`
```{r}
fviz_pca_ind(trait_pca)
```

#### Manually
```{r}
# create a df of trait_pca$x
scores_raw <- as.data.frame(trait_pca$x)

# bind breed name
scores <- bind_cols(breed_traits[,1], # first column where we have breed name
                    scores_raw)

# create objects indicating percent variance explained by PC1 and PC2
PC1_percent <- round((importance[2,1])*100, # index 2nd row, 1st column, times 100
                     1) # round to 1 decimal
PC2_percent <- round((importance[2,2])*100, 1) 

# plot
(scores_plot <- scores %>%
  ggplot(aes(x = PC1, y = PC2)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point(color = "black") +
  theme_minimal() +
  labs(x = glue("PC1: {PC1_percent}%"), 
       y = glue("PC2: {PC2_percent}%"), 
       title = "PCA Scores Plot of American Kennel Club Dog Trait Data"))
```

### Loadings plot
#### With `fviz_pca_var()`
```{r}
fviz_pca_var(trait_pca)
```


#### Manually
```{r}
# grab raw loadings, without any metadata
loadings_raw <- as.data.frame(trait_pca$rotation)

# move rowname to column
loadings <- loadings_raw %>%
  rownames_to_column(var = "Trait")

(loadings_plot <- loadings %>%
  ggplot(aes(x = PC1, y = PC2, label = Trait)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +  
  geom_point() +
  geom_label_repel(size = 2.5) +
  theme_minimal() +
  labs(x = glue("PC1: {PC1_percent}%"), 
       y = glue("PC2: {PC2_percent}%"), 
       title = "PCA Loadings Plot of American Kennel Club Dog Trait Data"))
```

### Scree and scores plots
```{r}
scree_plot + scores_plot
```


## 2. Make your PCA plot interactive (2 pts)

Make your PCA scores plot interactive, and so that when you hover each point, you can see what the name of that dog breed is (and only the breed of that dog).

```{r, message = FALSE, warning = FALSE}
scores_plotly <- scores %>%
  ggplot(aes(x = PC1, y = PC2, text = Breed)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point(color = "black") +
  theme_minimal() +
  labs(x = glue("PC1: {PC1_percent}%"), 
       y = glue("PC2: {PC2_percent}%"), 
       title = "PCA Scores Plot of American Kennel Club Dog Trait Data")

ggplotly(scores_plotly, tooltip = "text")
```

## 3. See how your PCA related to breed popularity (2 pts)

Using `breed_traits` and `breed_rank_all`, label the points that show data for the top 10 dog breeds in 2020 and color them different from the rest of the points. Your plot does not need to be interactive.

Try joining the dfs based on Breed
```{r}
# grab just breed and the rank in 2020
breed_rank_to_join <- breed_rank_all %>%
  select(Breed, `2020 Rank`)

# join with breed_traits  
joined <- left_join(breed_traits, breed_rank_to_join,
                    by = "Breed")

# check
knitr::kable(head(joined))
```

We have a bunch of NAs in `2020 Rank` lets figure out why. I notice we have NAs in all the Breed names that have spaces
```{r}
# create a vector of breed from each df
breed_traits_breed <- breed_traits$Breed
breed_rank_breed <- breed_rank_all$Breed

# are they the same?
all.equal(breed_traits_breed, breed_rank_breed) # no, many string mismatches

# testing with French Bulldogs bc those are my fav
test1 <- breed_traits_breed[2]
test2 <- breed_rank_breed[2]

library(stringi) # has the function stri_compare()
stri_compare(test1, test2) # 1 string different, but ok which one?

library(devtools) # allows you to use install_github
install_github("renozao/pkgmaker") # install pkgmaker for string comparison
library(pkgmaker)

str_diff(test1, test2) # the problem is in the space! But what? It looks the same to me

# this solution was from Dan Zhang (thanks Dan!)
charToRaw(test1)
charToRaw(test2)
# test1 has c2a0 for the space, and test2 has 20
# c2a0 is a weird type of space, and 20 is the regular space
# if you actually download and open the csvs in a program like excel
# you will see some weird unicode nonsense (my fault)


```


Trying another way to join. Since we can check that the dfs are ordered the same way, we can join by using `bind_cols()`.
```{r}
# create df of just 2020 rank to add to bind to scores
rank2020 <- breed_rank_all %>%
  select(`2020 Rank`)

# bind to scores
# joining is tough here because it doesn't handle spaces well
# here i made sure that both dfs were ordered exactly the same way
scores_popularity <- bind_cols(scores, rank2020)

scores_popularity_top10 <- scores_popularity %>%
  filter(`2020 Rank` <= 10)
```

By creating a new df and plotting
```{r}
# creating a new df and plotting
scores_popularity %>%
  ggplot() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point(aes(x = PC1, y = PC2)) +
  geom_point(data = scores_popularity_top10,
             aes(x = PC1, y = PC2), color = "darkcyan") +  
  geom_label_repel(data = scores_popularity_top10,
                  aes(x = PC1, y = PC2, label = Breed), 
                  color = "darkcyan", size = 2.5) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = glue("PC1: {PC1_percent}%"), 
       y = glue("PC2: {PC2_percent}%"), 
       title = "PCA Scores Plot of American Kennel Club Dog Trait Data",
       subtitle = "Labelled points are the top 10 most popular breeds from 2020")
```

Without creating a new df, and using `if_else()`
```{r}
# without creating a new df and if_else
scores_popularity %>%
  ggplot(aes(x = PC1, y = PC2)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point() +
  geom_point(color = if_else(scores_popularity$`2020 Rank` <= 10, "darkcyan", "black")) +
  geom_label_repel(aes(label = if_else(`2020 Rank` <= 10, Breed, NULL)),
                   size = 2.5, color = "darkcyan") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = glue("PC1: {PC1_percent}%"), 
       y = glue("PC2: {PC2_percent}%"), 
       title = "PCA Scores Plot of American Kennel Club Dog Trait Data",
       subtitle = "Labelled points are the top 10 most popular breeds from 2020")
```

Using `gghighlight()`
```{r}
# using gghighlight
scores_popularity %>%
  ggplot() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point(aes(x = PC1, y = PC2)) +
  gghighlight(`2020 Rank` <= 10, use_direct_label = TRUE, label_key = Breed) +
  theme_minimal() +
  labs(x = glue("PC1: {PC1_percent}%"), 
       y = glue("PC2: {PC2_percent}%"), 
       title = "PCA Scores Plot of American Kennel Club Dog Trait Data",
       subtitle = "Labelled points are the top 10 most popular breeds from 2020")
```


